Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sourcing chain data from Electrum #486

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

tnull
Copy link
Collaborator

@tnull tnull commented Mar 17, 2025

Closes #196.

So far, we support Esplora and Bitcoind RPC chain sources. In this PR we add Electrum support based on the blocking rust-electrum-client, and it's bdk_electrum and lightning-transaction-sync counterparts.

Due to the blocking nature of rust-electrum-client and as it directly connects to the server upon Client::new (see bitcoindevkit/rust-electrum-client#166), we ended up wrapping the runtime-specific behavior in an ElectrumRuntimeClient object that is initialized and dropped in Node::start and stop, respectively.

One thing missing that we still need to consider is how we'd reestablish connections to the remote after they have been lost for one reason or another. IMO, that behavior should live in rust-electrum-client to avoid all users having to duplicate it, so it's pending resolution of bitcoindevkit/rust-electrum-client#165

As we did with bitcoind-RPC, Electrum support is tested by adding another full_cycle integration test.

@tnull tnull added this to the 0.5 milestone Mar 17, 2025
@tnull tnull force-pushed the 2025-02-add-electrum-support branch from b386ee7 to 9352b2e Compare March 17, 2025 14:17
@tnull tnull requested a review from jkczyz March 17, 2025 14:51
@tnull tnull force-pushed the 2025-02-add-electrum-support branch 3 times, most recently from 568d10d to 678cf91 Compare March 18, 2025 12:50
Copy link
Contributor

@elnosh elnosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm familiarizing myself with the codebase but I tried running a node locally with the changes here pointing to an electrum server and basic operations are working. I just have a couple of questions, if I may, to get a better understanding.

Not strictly related to the PR but maybe I can sneak it in, why is it that for bitcoin core rpc and electrum they are in their separate bitcoind_rpc.rs and electrum.rs files but for esplora it is all in the mod.rs? I see that that the EsploraAsyncClient is directly used without a wrapper but wanted to see what were the reasons for it (:

@tnull
Copy link
Collaborator Author

tnull commented Mar 19, 2025

Not strictly related to the PR but maybe I can sneak it in, why is it that for bitcoin core rpc and electrum they are in their separate bitcoind_rpc.rs and electrum.rs files but for esplora it is all in the mod.rs?

Hmm, for one it has historic reasons, but also bitcoind RPC and electrum both ended up needing additional helper objects/types that made sense to split out to dedicated sub-modules. While we could eventually refactor the Esplora part so more code lives in a dedicated esplora.rs file, there is no strong necessity for it right now I think.

I see that that the EsploraAsyncClient is directly used without a wrapper but wanted to see what were the reasons for it (:

Right, that's essentially the main reason why it still lives in mod.rs. For electrum we ended up needing a wrapper that is created on runtime, and for bitcoind RPC we implemented a specific wrapper around RpcClient that handles the JSON parsing, essentially.

@tnull tnull force-pushed the 2025-02-add-electrum-support branch 3 times, most recently from 9608e66 to 3142c41 Compare March 19, 2025 12:54
@tnull tnull force-pushed the 2025-02-add-electrum-support branch 4 times, most recently from 466ab9f to e39b1f2 Compare March 27, 2025 08:47
@tnull
Copy link
Collaborator Author

tnull commented Mar 27, 2025

Blocked on rust-bitcoin/corepc#111 for now.

@tnull tnull force-pushed the 2025-02-add-electrum-support branch 2 times, most recently from 0d401fe to dda6bc7 Compare March 27, 2025 10:50
@jkczyz jkczyz removed their request for review March 27, 2025 18:42
@tnull tnull requested a review from joostjager March 27, 2025 19:04
@tnull tnull force-pushed the 2025-02-add-electrum-support branch from dda6bc7 to edb4b10 Compare March 28, 2025 08:55
@tnull
Copy link
Collaborator Author

tnull commented Mar 28, 2025

Rebased on main after #508 landed.

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question that I asked myself most during review is whether the way this PR deals with the runtime and start/stop is really the best option for now.

@tnull tnull force-pushed the 2025-02-add-electrum-support branch from edb4b10 to 97f4bd9 Compare March 28, 2025 16:21
@tnull tnull force-pushed the 2025-02-add-electrum-support branch 2 times, most recently from 2e12bf1 to be2496e Compare April 2, 2025 09:04
Copy link
Collaborator Author

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed pending feedback and rebased on main to resolve minor conflict.

@tnull tnull requested a review from joostjager April 2, 2025 10:37
@tnull tnull force-pushed the 2025-02-add-electrum-support branch from be2496e to 7cb0d1f Compare April 2, 2025 13:10
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not all the way through, but posting this now.

Main structural comment is the many instances of matching on backend and whether that can be converted to separate objects. More OO style and better readable.


let spawn_fut = self.runtime.spawn_blocking(move || electrum_client.batch_call(&batch));

let timeout_fut = tokio::time::timeout(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to have these timeouts here and is there nothing in electrum client? It surprises me because apparently retries are implemented in the client code, but not timeouts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a timeout parameter but we previously learned on the Esplora side to not necessarily trust the timeout parameters of the libraries we use (namely, reqwest which is used by esplora-client showed delayed handling of concurrent requests if the futures were polled from different threads). So we introduce reliable timeouts on top of the library-provided ones as a belt-and-suspender measure to make sure we never get into a state where a misbehaving task can lock up all of our worker threads. I guess we could consider to enable the timeouts on electrum-client in addition to the ones we set, but I'm not sure if coming up with yet another magic number is worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't electrum different because it isn't an async library?

I think locking up worker threads can still happen because the async task may finish on the timeout, but the worker thread remains blocked? It isn't cancelled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Now added a fixup also setting the timeout parameter on the electrum_client side (where we can).

Copy link
Contributor

@joostjager joostjager Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a point though in having an async timeout wrapper if the spawned thread would hang anyway?

Copy link
Collaborator Author

@tnull tnull Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. For one, now setting the timeouts hopefully has the spawned thread returning eventually. But if it would do so for some reason, it would be preferable to at least try to continue (while risking eventually running out of blocking worker threads if the issue persists), instead of just locking up and not continuing with subsequent sync attempts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?

If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.

Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?

The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?

No, I think the calls were cancelled and we then simply continued on the next timer tick.

If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.

I don't think it's super essential as in most circumstances when the connection would fail, we'd still error once the connection is dropped, IIUC. But yes, we probably don't want to lean too much on behavior that might also very well OS-specific.

Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?

Chain syncing and fee rate updates are critical for the operation of a Lightning node. When some process would really get stuck for an unexpected reason, we basically have two options: panic, hoping that a restart or operator intervention would fix the issue, or retry, risking that we might exhaust the thread pool eventually if the exact same unexpected behavior persists 512 times in a row. I think the latter is preferable.

The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.

Yeah, tbh. I mostly added it for the sake of uniformity across the different chain sources and as a general 'belt-and-suspenders' kind of thing. If we don't think 'belt-and-suspenders' are needed, we could consider dropping the timeouts everywhere and relying on the dependencies. I'm on the fence about going this way.

But, tbh., I currently don't fully see why being doubly cautious in this amalgamation between async runtimes and blocking threads is a bad thing, especially since the behavior of the dependency crates might very well (silently) change down the line, even if we review their code now and are positive that the double timeouts are not required in one case or another. FWIW, there is a reason why we added them in the first place, namely that users in production had their node lock up for a really long time due to completely unpredictable behavior arising from reusing reqwest clients across different thread contesxts (reusing clients is explicitly endorsed in their docs, btw).

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few final non-blocking things about code re-use.

@tnull tnull force-pushed the 2025-02-add-electrum-support branch 4 times, most recently from 96ee279 to cb5dc5b Compare April 3, 2025 12:51
@tnull
Copy link
Collaborator Author

tnull commented Apr 3, 2025

Squashed fixups without further changes.

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there. Just the one question about the double timeouts.


let spawn_fut = self.runtime.spawn_blocking(move || electrum_client.batch_call(&batch));

let timeout_fut = tokio::time::timeout(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?

If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.

Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?

The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.

tnull added 14 commits April 4, 2025 10:21
.. we need to bump the version, while making sure `electrsd` and CLN CI
are using the same version. As the `blockstream/bitcoind` version hasn't
reached v28 yet, we opt for v27.2 everywhere for now.
We upgrade our tests to use `electrum-client` v0.22 and `electrsd` v0.31
to ensure compatibility with `bdk_electrum` were about to start using.
By default `rustls`, our TLS implementation of choice, uses `aws-lc-rs`
which requires `bindgen` on some platforms. To allow building with
`aws-lc-rs` on Android, we here install the `bindgen-cli` tool before
running the bindings generation script in CI.
We here setup the basic API and structure for the
`ChainSource::Electrum`.

Currently, we won't have a `Runtime` available when initializing
`ChainSource::Electrum` in `Builder::build`. We therefore isolate any
runtime-specific behavior into an `ElectrumRuntimeClient`.

This might change in the future, but for now we do need this workaround.
Currently, we won't have a `Runtime` available when initializing
`ChainSource::Electrum`. We therefore isolate any runtime-specific
behavior into the `ElectrumRuntimeStatus`.

Here, we implement `Filter` for `ElectrumRuntimeClient`, but we need to
cache the registrations as they might happen prior to
`ElectrumRuntimeClient` becoming available.
.. as we do with `BitcoindRpc`, we now test `Electrum` support by
running the `full_cycle` test in CI.
@tnull tnull force-pushed the 2025-02-add-electrum-support branch from cb5dc5b to db756c6 Compare April 4, 2025 08:21
@tnull
Copy link
Collaborator Author

tnull commented Apr 4, 2025

The corerpc-node just shipped the needed fix as part of the 0.7 minor release, now we just need RCasatta/electrsd#100 to land as part of an electrsd release to go forward with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Electrum support
4 participants